The basic functionality of the Prophet model takes a dataframe with two variables; \(ds\) contains the date/time vector, and \(y\) contains the observed datapoints in which we want to forecast over a given time period.
Meta states they use an additive model (i.e \(x_t = m_t + s_t + y_t\)), and that it works best with data containing strong seasonal effects, and multi-seasonality.
In this document I am going to explore the functionality, use, and accuracy of Prophet.
For starters, you can use Prophet by calling
library(prophet)
Let’s use the UKgas dataset, built into R.
Before we experiment with the Prophet model, let’s just view our data.
plot(UKgas)
This should be a strong model, as there is clear seasonality and trend.
However, before we use Prophet to try and forecast some predicted values. Let’s split our dataset into 80% training and 20% testing data.
This way we can test the accuracy of the model.
Let’s first find out at which datapoint exactly we need to split for an 80-20 test.
total_length = length(UKgas)
train_length = total_length*0.8
cutoff = train_length
cutoff
## [1] 86.4
Since our data is measured in quarters, we need to convert this to
years. Additionally we need our cutoff to be a whole number, so we will
use the floor() function to round down.
year_cutoff = floor(cutoff/4)
year_cutoff
## [1] 21
So we shall assign our testing dataframe accordingly. Starting at our original start date and ending +21.5 years later.
UKgas_train = ts(UKgas,start=1960,end=1960+year_cutoff,frequency=4)
ds_train_gas = as.yearmon(time(UKgas_train))
y_train_gas = UKgas_train
df_train_gas = data.frame(ds=ds_train_gas,y=y_train_gas)
Here is a preview of our dataframe df_train_gas , we can
see now it ends in July 1981.
tail(df_train_gas)
## ds y
## 80 Oct 1979 542.7
## 81 Jan 1980 840.5
## 82 Apr 1980 414.6
## 83 Jul 1980 217.7
## 84 Oct 1980 670.8
## 85 Jan 1981 848.5
Now we can create our prophet model. To do this we simply use the
prophet() function with our dataframe as the primary
variable.
Since our data is recorded quarterly we can disable daily & weekly seasonality parameters.
m_train_gas = prophet(df=df_train_gas,daily.seasonality = FALSE, weekly.seasonality = FALSE)
Remember we want to predict the next 20% of our original data, so we can compare later. Let’s check how many quarters ahead we need to predict.
length(UKgas)-length(UKgas_train)
## [1] 23
Ok so 23 quarters ahead is our prediction range.
Now, to create your forecast you use the
make_future_dataframe() function. This function requires a
few things:
m_train_gas"quarter" to
go quarterlyf_train_gas = make_future_dataframe(m_train_gas,23,freq="quarter")
This has given us a extended \(ds\) set, including the historical \(ds\) values.
head(f_train_gas$ds); tail(f_train_gas$ds)
## [1] "1960-01-01 GMT" "1960-04-01 GMT" "1960-07-01 GMT" "1960-10-01 GMT"
## [5] "1961-01-01 GMT" "1961-04-01 GMT"
## [1] "1985-07-01 GMT" "1985-10-01 GMT" "1986-01-01 GMT" "1986-04-01 GMT"
## [5] "1986-07-01 GMT" "1986-10-01 GMT"
And so now our original dataset and our f_train_gas
future set should be the same length. Let’s check this.
length(UKgas)==length(f_train_gas$ds)
## [1] TRUE
Now we can finally predict using our prophet forecast and model,
using the base-R function predict().
p_gas = predict(m_train_gas,f_train_gas)
We now have everything we need to perform view our simple Prophet forecast. We must now plot our prophet model, and our forecasted data together.
plot(m_train_gas,p_gas)
There are a few important components to note in the prophet plot:
m_train_gas$historyp_gas$yhatp_gas$yhat_upper &
p_gas$yhat_lowerWe can see the forecast predicts the same upwards trend with the seasonality. However, there is more that the prophet model will show us. We will cover the further components in Section 4
We will now see how it compares to the actual data.
Let’s plot our whole original dataset UKgas.
plot(UKgas)
Now let’s add over our predicted values, \(\hat{y}\)
y = as.numeric(UKgas)
y_hat = p_gas$yhat
x = p_gas$ds
y_hat_upper = p_gas$yhat_upper
y_hat_lower = p_gas$yhat_lower
plot(x,y,type="line",cex=4)
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character
lines(x,y_hat,col="red",lty=1,cex=4)
lines(x,y_hat_upper,col="lightblue",lty=2,cex=4)
lines(x,y_hat_lower,col="lightblue",lty=2,cex=4)
legend("topleft",legend=c("Actual","Predicted","Confidence Interval"),col=c("black","red","lightblue"),lty=c(1,1,2))
We can see that the fitted model does not exactly scale with the actual date. The predictions are all largely underestimated.
The mean squares error (MSE) is a good metric to see how accurate our model has predicted the \(\hat{y}\) values. It uses the following formula: \(MSE = \frac{\Sigma((\hat{y_i}-y_i)^2)}{n}\)
MSE_gas = mean((y-y_hat)^2)
MSE_gas
## [1] 12710.19
This is quite a high error. But we don’t have any comparison to put into perspective how well the model performed.
So let’s try make another model using the same data, but attempting to reduce the heteroskedasticity which makes model prediction much harder.
Now we are going to log our data. This is one method to reduce heteroskedasticity and stabilise the variance.
log_gas = log(UKgas)
plot(log(UKgas))
Now let’s apply the Prophet forecasting model to our new log-data.
UKgas_log_train = ts(log_gas,start=1960,end=1960+year_cutoff,frequency=4)
ds_train_gas = as.yearmon(time(UKgas_train))
y_log_train_gas = UKgas_log_train
df_log_train_gas = data.frame(ds=ds_train_gas,y=y_log_train_gas)
m_log_gas = prophet(df_log_train_gas,daily.seasonality = FALSE, weekly.seasonality = FALSE)
f_log_gas = make_future_dataframe(m_log_gas,23,freq="quarter")
p_log_gas = predict(m_log_gas,f_log_gas)
plot(m_log_gas,p_log_gas)
This seems to have worked to some extent. Let’s view our predicted data with our original data.
y = as.numeric(log_gas)
y_hat = p_log_gas$yhat
y_hat_upper = p_log_gas$yhat_upper
y_hat_lower = p_log_gas$yhat_lower
x = p_gas$ds
plot(x,y,type="line",cex=4,main="Box-Cox Predicted vs Actual")
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character
lines(x,y_hat,col="red",lty=1,cex=4)
lines(x,y_hat_upper,col="lightblue",lty=2,cex=4)
lines(x,y_hat_lower,col="lightblue",lty=2,cex=4)
legend("topleft",legend=c("Actual","Predicted","Confidence Interval"),col=c("black","red","lightblue"),lty=c(1,1,2))
This looks like a much stronger model than our original. Let’s view one more transformation in order to get a final comparison.
The BoxCox lambda, \(\lambda\) determines whether variance increases or decreases. Let’s find this value.
lambda = BoxCox.lambda(UKgas)
lambda
## [1] -0.4457023
Using this \(\lambda\) we can apply a Box-Cox transformation.
boxcox_gas = BoxCox(UKgas,lambda)
plot(boxcox_gas)
This transformation seems to have done a much better job of decreasing the the heteroskedasticity. Let’s use this model for our MSE comparison.
We will now set up an identical prophet model as we did in Section 1, but this time we will use the BoxCox transformed data as our \(y\). Still using our 80-20 test-training split, otherwise we won’t be able to calculate an MSE value.
UKgas_boxcox_train = ts(boxcox_gas,start=1960,end=1960+year_cutoff,frequency=4)
ds_train_gas = as.yearmon(time(UKgas_train))
y_boxcox_train_gas = UKgas_boxcox_train
df_boxcox_train_gas = data.frame(ds=ds_train_gas,y=y_boxcox_train_gas)
m_boxcox_gas = prophet(df_boxcox_train_gas,daily.seasonality = FALSE, weekly.seasonality = FALSE)
f_boxcox_gas = make_future_dataframe(m_boxcox_gas,23,freq="quarter")
p_boxcox_gas = predict(m_boxcox_gas,f_boxcox_gas)
plot(m_boxcox_gas,p_boxcox_gas)
And let’s view the predicted data overlapped our actual data
y = as.numeric(boxcox_gas)
y_hat = p_boxcox_gas$yhat
y_hat_upper = p_boxcox_gas$yhat_upper
y_hat_lower = p_boxcox_gas$yhat_lower
x = p_gas$ds
plot(x,y,type="line",cex=4,main="Box-Cox Predicted vs Actual")
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character
lines(x,y_hat,col="red",lty=1,cex=4)
lines(x,y_hat_upper,col="lightblue",lty=2,cex=4)
lines(x,y_hat_lower,col="lightblue",lty=2,cex=4)
legend("topleft",legend=c("Actual","Predicted","Confidence Interval"),col=c("black","red","lightblue"),lty=c(1,1,2))
By obersevation, this is a way stronger model. Let’s prove this using the MSE.
We saw earlier the MSE for the original model was:
MSE_gas
## [1] 12710.19
Now using the same metric, our MSE for the log data is:
MSE_log = mean((log_gas-p_log_gas$yhat)^2)
MSE_log
## [1] 0.04502534
This is much closer to zero, signifying that this is a stronger model. As we expected from viewing our plot.
Now the MSE for the Box-Cox transformed data:
MSE_boxcox = mean((boxcox_gas-p_boxcox_gas$yhat)^2)
MSE_boxcox
## [1] 0.0001546818
This value is incredibly close to zero. Therefore this is a incredibly accurate prediction model on the transformed data.
It is safe to say that the Box-Cox transformation removed heteroskedasticity more reliably, and therefore allowed us the make a more accurate prediction using Prophet.
Finally, there are some more components under the Prophet model we
can analyse. To do this, we use the
prophet_plot_components() function.
Let’s look at the components of our first model.
prophet_plot_components(m_train_gas,p_gas)
Since we recorded data quarterly, we can only see the yearly seasonality of the UK Gas Usage.
However, the prophet model can do much more than this. To see the range of analysis that Prophet provides, let’s use an alternative dataset. One with more frequent data recordings.
energy = read.csv("energy_dataset.csv")
This is the hourly estimated consumption of energy in Megawatts for the entire country of Spain.
Let’s see how it looks using a basic plot.
energy_full = energy$y
ts_energy = ts(energy_full,start=2015,frequency=24*365.25)
plot(ts_energy,type="line")
## Warning in plot.xy(xy.coords(x, y), type = type, ...): plot type 'line' will be
## truncated to first character
Again, we must create a prophet model using this data. I will not explain the steps as we have seen this process a few times now. As we are not going to assess accuracy of a model, just analyse the components, I also will not split the data into test and training.
y_energy = energy_full
ds_energy = energy$ds
df_energy = data.frame(ds=ds_energy,y=y_energy)
m_energy = prophet(df_energy)
f_energy = make_future_dataframe(m_energy,1,freq="year")
p_energy = predict(m_energy,f_energy)
prophet_plot_components(m_energy,p_energy)
Now we get much more information. We can extrapolate some data from this
seasonality breakdown:
Let’s view our data in a prophet plot.
plot(m_energy,p_energy)
Notice, due to the high density of data points our plot is very hard
to read on the whole timescale. Let’s use the
library(dygraphs) to manipulate the scale of our plot. This
should make it easier to zoom in and analyse the small-scale
seasonalities and trends.
dyplot.prophet(m_energy,p_energy)
## Warning: `select_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `select()` instead.
## ℹ The deprecated feature was likely used in the prophet package.
## Please report the issue at <https://github.com/facebook/prophet/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.